Last week, I wrote a post on the new version of Google Analytics which drew a response from Avinash Kaushik, Web metrics guru and current Google analytics evangelist. Avinash, well-known for his blog and best-selling book, wrote me to clarify some of what I explained last week. I thought his points were well-taken, so I am revisiting the subject today.
Avinash wanted to clarify my discussion of Google Analytics’ ability to handle very high traffic volumes. I characterized Google’s approach as sampling which dropped data, but Avinash explained it more clearly:
For the high volume traffic, the Google Analytics’ Terms Of Service (TOS) is very clear. Up to five million hits a month is OK for anyone, if you go beyond that (and five million is a huge number to hit!) then it states that you must be an AdWords customer (though no spending requirement is stated) and then the number increases x times (x is not specified but makes it a number of times the five million hits, even if you multiply it by two or five).
Perhaps the most important thought is that Google Analytics does not drop data. 100% of the data is collected, even if you send hundreds of millions of hits Per Day (which many sites I know do—seems crazy!).
There are two sampling scenarios:
1) The GA TOS states that, at Google’s discretion, it may request you to sample the data, i.e., send it less data (though given the numbers are talking about millions of hits a day the sample captured will still be more than statistically significant). All data will be processed and report to you.
2) In the second scenario, again, 100% of the data is collected but if you run very large queries for a very large time period (since Google never deletes your data no matter how large), then that query will sample the data stored to ensure it actually returns results to you. Intelligent algorithms are applied to ensure that the results are statistically accurate.
If this happens, the Google Analytics report will clearly state that the data was sampled and it shows you the rate of sampling (in a little yellow square next to each metric).
Both of the above are precisely what all paid Web analytics tools do, too. Every vendor has contracted limits in terms of data you can send them. The more data you contract for, the more you pay.
If you breach your contract limits with any paid vendor, they will ask you to either sample the data at the collection point (your site), so you don’t send them all the data, or they will ask you to pay more to collect all the data. Even in the latter scenario, to have your queries return with results (and I say this from real experience) you will have to sample the data that has been collected.
Avinash also thought that I should clarify when support for the old version will end. He rightly points out that Google has announced no end date for support, but I seized on their statement that it will be “at least 12 to 18 months” as why I advised folks that they might want to consider moving this year. It’s certainly possible that Google will be supporting the old code even several years from now. Avinash advises existing Google Analytics users to switch to the new version if they need the new features, but not to worry about it too much if they don’t.
Thanks, Avinash, for helping my readers with this important decision.